Effective Data Visualization

Steve Elston

01/10/2022

Soving Data Science Coding Problems

Correctly using programming environments is a core data science skill; Python, R, SQL,…..

Effective Visualization for Exploration and Communications

Visualization is primarily a form of communications

Visualizing Large Complex Data is Difficult

Problem: Modern data sets are growing in size and complexity

Limitation Scientific Graphics

All scientific graphics are limited to a 2-dimensional projection

Why is Perception Important?

Use Aesthetics to Improve Perception

Over-plotting

Over-plotting occurs in plots when the markers lie one on another.

Dealing with Over-plotting

What can we do about over-plotting?

Example of Overplotting

Use Transparency, Marker Size, Downsampling

Alternatives to avoid over-plotting for truly large data sets

Hexbin Plot

Countour Plot

Other Methods to Display Large Data Sets

Sometimes a creative alternative is best

Time Series of Box Plots

bivariate measures

bg right w:600

joint plots

bg right w:600

Organization of Plot Aesthetics

We can organize aesthetics by their effectiveness:

  1. Easy to perceive plot aesthetics: help most people gain understanding of data relationships

  2. Aesthetics with moderate perceptive power: useful properties to project data relationships when used sparingly

  3. Aesthetics with limited perceptive power: useful within strict limits

Properties of Common Aesthetics

Property or Aesthetic Perception Data Types
Aspect ratio Good Numeric
Regression lines Good Numeric plus categorical
Marker position Good Numeric
Bar length Good Counts, numeric
Sequential color palette Moderate Numeric, ordered categorical
Marker size Moderate Numeric, ordered categorical
Line types Limited Categorical
Qualitative color palette Limited Categorical
Marker shape Limited Categorical
Area Limited Numeric or categorical
Angle Limited Numeric

Aspect Ratio

\[aspect\ ratio = \frac{width}{height}\ : 1\]

Example of Changing Aspect Ratio

Longest scientific time series is the sunspot count:

##      YEAR  SUNACTIVITY
## 0  1700.0          5.0
## 1  1701.0         11.0
## 2  1702.0         16.0
## 3  1703.0         23.0
## 4  1704.0         36.0

Example of Changing Aspect Ratio

Example of Changing Aspect Ratio

Sequential and Divergent Color Palettes

Use of color as an aesthetic in visualization is a complicated subject.

Auto Weight by Sequential Color Palette

Limits of color

Regardless of the approach there are some significant limitations

Marker Size

Marker size is moderately effective aesthetic useful for quantitative variables

Engine Size by Marker Size and Price by Sequential Color Palette

Line Plots and Line Type

Line plots connect discrete, ordered, data points by a line

Limits of Line Type

Marker Shape

Marker shape is useful for displaying categorical relationships

  1. The number of categories is small
  2. Distinctive shape are chosen for the markers

Aspiration by Marker Shape

higher dimensional data

In higher dimensions, things get more challenging, but still manageable up to a certain point (usually 5 or so dimensions)
- we can use aesthetics to add additional dimensions to visualizations, but we quickly run out of elements
- we can use faceting to break up a plot into many, but having too many plots to look at can be overwhelming

As dimensionality goes up, we need to rely on more advanced methods, but as we learn later there’s no such thing as a free lunch
- run a ML algorithms for dimensionality reduction
- use visualizations such as t-SNEs meant to deal with such situations

Aesthetics

bg right w:600

Facet plots

bg right w:600

Facet Plot with Weather by Season

Correlation matrix

bg right w:600

Scatter plot matrix

bg right w:600

Spurious correlations

bg right w:900

Simpson’s paradox

Simpson’s paradox gives rise to false associations

bg right w:500

Simpson’s paradox

Simpson’s paradox arises from a latent variable

Simpson’s paradox

With categorical data, Simpson’s paradox can occur when the relative size of the groups is different between the control and treatment

bg right w:600

Anscombe’s quartet

bg right w:600

Summary

We have explored these key points
- Proper use of plot aesthetics enable projection of multiple dimensions of complex data onto the 2-dimensional plot surface.
- All plot aesthetics have limitations which must be understood to use them effectively
- The effectiveness of a plot aesthetic varies with the type and the application

the end